Parallel Sampling of HDPs using Sub-Cluster Splits
نویسندگان
چکیده
We develop a sampling technique for Hierarchical Dirichlet process models. The parallel algorithm builds upon [1] by proposing large split and merge moves based on learned sub-clusters. The additional global split and merge moves drastically improve convergence in the experimental results. Furthermore, we discover that cross-validation techniques do not adequately determine convergence, and that previous sampling methods converge slower than were previously expected.
منابع مشابه
Supplemental Material for Parallel Sampling of HDPs using Sub-Cluster Splits
In the following supplemental material we provide some additional details and derivations for the paper. We begin by showing how to calculate the joint distribution of p(β, z), marginalizing out π, in Section 1. Then, in Section 2 we consider looking at joint log-likelihoods of HDP topic models and show that the typical set of the distribution is very far from the mode. In Sections 3-4, we give...
متن کاملParallel Sampling of DP Mixture Models using Sub-Cluster Splits
We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed s...
متن کاملParallel Sampling of DP Mixture Models using Sub-Clusters Splits
We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed s...
متن کاملCollapsed Gibbs Sampling for Latent Dirichlet Allocation on Spark
In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P ∗ P partitions, shuffles a...
متن کاملSupplemental Material for Parallel Sampling of DP Mixture Models using Sub-Clusters Splits
In this section, we show the derivation of the posterior distribution over cluster-weights, π, conditioned on the cluster labels, z. We begin with the definition of a Dirichlet process from [1]. Definition A.1 (Dirichlet Process). Let H be a measure on a measureable space, Ω. If for any finite partition, (A1, A2, · · · , AK) of the space, the measure, G, on the partition follows the following D...
متن کامل